The link https://www.kaggle.com/center-for-policing-equity/data-science-for-good from where i have downloaded the dataset. The dataset contains 2383 rows and 47 columns in it. At first i have seen whether there is any presence of missing value or not and if it is there i have replace it with mean.The dataset contains both categorical and numerical variables and the categorical variable is converted to numeric type. After this grahical interpretation and analysis of data is done
Preprocessing data is a vital step, and in this stage, I changed the date, time, and hours into 1 hr format. Before doing any analysis preprocessing of data is very important
A graph for officer gender is plooted to show which gender is most likely to get injured. From the graph it is seen that male has chances of getting more injury than the female from the graph
## PIE CHART The pie chart for the subject race is plotted which has
variables as white,black,hispanic and other. From the graph it is seen
that white is most present with more than 40 percent where as rest where
as rest of the group is present 54 percent which includes black,hispanic
and other
A time series graph is plotted for incident date and time to see its trend Time series analysis is a statistical technique used to analyze and interpret data points that are collected over time. In time series analysis, data is typically analyzed to identify patterns, trends, and relationships between variables.
## `summarise()` has grouped output by 'INCIDENT_DATE'. You can override using the
## `.groups` argument.
The number of values in an officer race table is plotted. White people appear 1470 times, Hispanic people appear 482 times, and black people appear 341 times.
## OFFICER_RACE
## American Ind Asian Black Hispanic Other White
## 8 55 341 482 27 1470
For officer gender the content inside the data is seen and it contain male and female. Male is present 2143 times where as female is present 240 times
## The following object is masked _by_ .GlobalEnv:
##
## OFFICER_RACE
## The following objects are masked from dat (pos = 3):
##
## BEAT, dayname, DIVISION, FORCE_EFFECTIVE, INCIDENT_DATE,
## INCIDENT_REASON, INCIDENT_TIME, LOCATION_CITY, LOCATION_DISTRICT,
## LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION, LOCATION_LATITUDE,
## LOCATION_LONGITUDE, LOCATION_STATE, monthname, NUMBER_EC_CYCLES,
## OFFICER_GENDER, OFFICER_HIRE_DATE, OFFICER_HOSPITALIZATION,
## OFFICER_ID, OFFICER_INJURY, OFFICER_INJURY_TYPE, OFFICER_RACE,
## OFFICER_YEARS_ON_FORCE, REASON_FOR_FORCE, REPORTING_AREA, SECTOR,
## STREET_DIRECTION, STREET_NAME, STREET_NUMBER, STREET_TYPE,
## SUBJECT_DESCRIPTION, SUBJECT_GENDER, SUBJECT_ID, SUBJECT_INJURY,
## SUBJECT_INJURY_TYPE, SUBJECT_OFFENSE, SUBJECT_RACE,
## SUBJECT_WAS_ARRESTED, time1h, TYPE_OF_FORCE_USED1,
## TYPE_OF_FORCE_USED10, TYPE_OF_FORCE_USED2, TYPE_OF_FORCE_USED3,
## TYPE_OF_FORCE_USED4, TYPE_OF_FORCE_USED5, TYPE_OF_FORCE_USED6,
## TYPE_OF_FORCE_USED7, TYPE_OF_FORCE_USED8, TYPE_OF_FORCE_USED9,
## UOF_NUMBER
## OFFICER_GENDER
## Female Male
## 240 2143
The graph for officer race is plotted for officer years on for and it is seen that which gender has more officer years on force. It is seen male that more chances of getting so rather than female
From the graph we can tell how many times male and female is present in the column. The second line of code creates the bar chart by adding a geom_bar layer and specifying “stat = count” to count the number of officers in each gender category. The “binwidth = 0.1” parameter makes each bar 0.1 wide, while the “fill = dark green” argument makes the bars dark green.
## Warning in geom_bar(stat = "count", binwidth = 0.1, fill = "dark green"):
## Ignoring unknown parameters: `binwidth`
A graph for offcer id is plotted and visualize to see its trend
ggplotly(ggplot(dat,aes(x=OFFICER_ID))+geom_bar(stat="count",binwidth=0.1,fill="dark red")+theme_minimal())
## Warning in geom_bar(stat = "count", binwidth = 0.1, fill = "dark red"):
## Ignoring unknown parameters: `binwidth`
A count plot of subject gender and race is plotted and from the graph it is said black male is present max no of times and so is the female
q3 <- ggplot(data=subset(dat, !is.na(time1h)), aes(x=factor(time1h))) +
geom_bar(fill="#a13864") +
scale_x_discrete(breaks=0:23) +
xlab("Hour") + ylab("Number of Offences") +
ggtitle("Number of Offences by Hour")
A graph for no of offenses that is talking place per hour is plotted and and visualize in this graph. It is seen that in the 20th hour that is at 8pm most of the offenses is taking place here. The data for the figure is a subset of a bigger data frame named dat that removes any rows with a missing value for the variable time1h (NA).
The data frame dat and the aes() function are used to initialise the plot, with the ggplot() function indicating that the x variable should be the time1h variable (converted to a factor).
The geom_bar() method inserts bars into the plot with the fill colour “#a13864”.
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.1.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
q3 <- ggplot(data=subset(dat, !is.na(time1h)), aes(x=factor(time1h))) +
geom_bar(fill="#a13864") +
scale_x_discrete(breaks=0:23) +
xlab("Hour") + ylab("Number of Offences") +
ggtitle("Number of Offences by Hour")
grid.arrange(q3)
## GRAPHS A graph for no of offense that is taking place is plotted and
visualize. It can be said from the graph in which month most of the
offenses is taking place. In the month of March the majority of the
offense took place where as Decembder is the month which has least
offences
q5labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
q5 = ggplot(data=dat, aes(x=monthname)) + geom_bar(fill="#946994")+ xlab("Month")+ scale_x_discrete(labels=q5labels) + ylab("no of offences") + ggtitle("Number of offences by month")
grid.arrange(q5)
## BOX PLOT The ggplot() function is used to initialize the plot with
the dat data frame, and the aes() function is used to specify that the x
variable should be the OFFICER_RACE variable, while the y variable
should be the OFFICER_YEARS_ON_FORCE variable, converted to numeric
using the as.numeric() function.
The geom_boxplot() function adds a box plot layer to the plot, which shows the distribution of the OFFICER_YEARS_ON_FORCE variable for each value of OFFICER_RACE
ggplotly(ggplot(dat, aes(x = OFFICER_RACE, y = as.numeric(OFFICER_YEARS_ON_FORCE))) +
geom_boxplot() +
scale_y_log10() + ggtitle(label = "officer years on force based on gender"))
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 3 rows containing non-finite values (`stat_boxplot()`).
The ggplot() function is used to initialise the plot with the dat data frame, and the aes() function is used to specify that the x variable should be the OFFICER_RACE variable, and the y variable should be the OFFICER_YEARS_ON_FORCE variable, which is converted to numeric using the as.numeric() function.
The geom_boxplot() function adds a box plot layer to the plot, displaying the distribution of the OFFICER_YEARS_ON_FORCE variable for each value of OFFICER_RACE.
q9 <- dat %>%
filter(SUBJECT_RACE != "NULL",
TYPE_OF_FORCE_USED1 %in% c("Verbal Command", "Weapon display at Person", "Held Suspect Down", "BD - Grabbed", "Take Down - Arm", "Taser")) %>%
count(SUBJECT_RACE, TYPE_OF_FORCE_USED1) %>%
group_by(SUBJECT_RACE) %>%
mutate(prop = n / sum(n)) %>%
ggplot(aes(x = SUBJECT_RACE, y = prop, fill = TYPE_OF_FORCE_USED1)) +
geom_bar(stat = "identity", position = "fill") +
scale_y_continuous(labels = scales::percent) +
xlab("Subject race") + ylab("Proportion") +
guides(fill = guide_legend(title = "Type of force")) +
scale_x_discrete(labels = c("AI", "A", "B", "H", "O", "W")) +
ggtitle("Type of force used on races") +
scale_fill_brewer(palette = "BrBG")
q9
A map is plotted by taking the latitude and longitude tabel from the dataframe
library(ggmap)
## Warning: package 'ggmap' was built under R version 4.1.3
## i Google's Terms of Service: <https://mapsplatform.google.com>
## i Please cite ggmap if you use it! Use `citation("ggmap")` for details.
##
## Attaching package: 'ggmap'
##
##
## The following object is masked from 'package:plotly':
##
## wind
library(rgdal)
## Warning: package 'rgdal' was built under R version 4.1.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.1.3
## Please note that rgdal will be retired during 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-5, (SVN revision 1199)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.4.1, released 2021/12/27
## Path to GDAL shared files: C:/Users/PRATYAY/Documents/R/win-library/4.1/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ runtime: Rel. 7.2.1, January 1st, 2021, [PJ_VERSION: 721]
## Path to PROJ shared files: C:/Users/PRATYAY/Documents/R/win-library/4.1/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.6-0
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
library(ggplot2)
library(Rcpp)
#install.packages("sf")
library(sf)
## Warning: package 'sf' was built under R version 4.1.3
## Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 7.2.1; sf_use_s2() is TRUE
#install.packages("leaflet")
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.1.3
library(leaflet.extras)
## Warning: package 'leaflet.extras' was built under R version 4.1.3
dat$lat<-as.numeric(dat$LOCATION_LATITUDE)
dat$long<-as.numeric(dat$LOCATION_LONGITUDE)
dat %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers(popup = ~TYPE_OF_FORCE_USED1,radius = ~ sqrt(7))
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 55 rows with
## either missing or invalid lat/lon values and will be ignored
map <- leaflet(dat)%>%
# Base groups
addTiles(group = "OSM (default)") %>%
addProviderTiles(providers$Stamen.TonerLite, group = "Toner Lite")
map <- map%>% addCircles(popup ="White",data = dat[OFFICER_RACE=="White",],
group = "White",col="#d73027")%>%
addCircles(popup ="Black",data = dat[OFFICER_RACE=="Black",],
group = "Black",col="#f46d43")%>%
addCircles(popup ="Hispanic",data = dat[OFFICER_RACE=="Hispanic",],
group = "Hispanic",col="#fdae61")%>%
addCircles(popup ="American Ind",data = dat[OFFICER_RACE=="American Ind",],
group = "American Ind",col="#ffffbf")
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 1 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 7 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 10 rows with
## either missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 35 rows with
## either missing or invalid lat/lon values and will be ignored
map%>% addLayersControl(
baseGroups = c("OSM (default)", "Toner Lite"),
overlayGroups = c("White","Black", "Hispanic","American Ind"),
options = layersControlOptions(collapsed = TRUE))
The graphs below indicate how the number of infractions is related to time, day of week, and month. From the graph we can visualize and say in which day most of the offences is taking place each day in a week and its count
q3labels = seq(0, 23, by=1)
q3labels = as.character(q3labels)
q3 = ggplot(data=subset(dat, !is.na(time1h)), aes(x=time1h)) + geom_bar(fill="#a13864") + scale_x_discrete(labels=q3labels) + xlab("Hour") + ylab("no of offences") + ggtitle("Number of offences by hour")
q6mon = ggplot(data=subset(dat, !is.na(time1h) & dayname=="Monday"), aes(x=time1h)) + geom_bar(fill="#41b6c4") + scale_x_discrete(labels=q3labels) + xlab("Monday")+ theme(axis.text.x = element_blank()) + ylab("no of offences")
q6tue = ggplot(data=subset(dat, !is.na(time1h) & dayname=="Tuesday"), aes(x=time1h)) + geom_bar(fill="#2da3ce") + scale_x_discrete(labels=q3labels) + xlab("Tuesday")+ theme(axis.text.x = element_blank())+ ylab("no of offences")
q6wed = ggplot(data=subset(dat, !is.na(time1h) & dayname=="Wednesday"), aes(x=time1h)) + geom_bar(fill="#508dcc") + scale_x_discrete(labels=q3labels) + xlab("Wednesday")+ theme(axis.text.x = element_blank())+ ylab("no of offences")
q6thu = ggplot(data=subset(dat, !is.na(time1h) & dayname=="Thursday"), aes(x=time1h)) + geom_bar(fill="#7a72b9") + scale_x_discrete(labels=q3labels) + xlab("Thursday")+ theme(axis.text.x = element_blank())+ ylab("no of offences")
q6fri = ggplot(data=subset(dat, !is.na(time1h) & dayname=="Friday"), aes(x=time1h)) + geom_bar(fill="#975494") + scale_x_discrete(labels=q3labels) + xlab("Friday")+ theme(axis.text.x = element_blank())+ ylab("no of offences")
q6sat = ggplot(data=subset(dat, !is.na(time1h) & dayname=="Saturday"), aes(x=time1h)) + geom_bar(fill="#a13864") + scale_x_discrete(labels=q3labels) + xlab("Saturday")+ theme(axis.text.x = element_blank())+ ylab("no of offences")
q6sun = ggplot(data=subset(dat, !is.na(time1h) & dayname=="Sunday"), aes(x=time1h)) + geom_bar(fill="#583e59") + scale_x_discrete(labels=q3labels) + xlab("Sunday")+ theme(axis.text.x = element_blank())+ ylab("no of offences")
grid.arrange(q6mon, q6tue, q6wed, q6thu, q6fri, q6sat, q6sun, top = "Number of offences ")
Thus all the necessary explorator data analysis and visualization of the data is done from the dataset. From here we can visualize in which day, month and at what time most of the offences is taking place. It can be also check which group is more subjected to injury and what is the most used arms.I have done all the preprocessing steps before doing the relevant work. Through advanced graph like histogram,bar plot,map, boxplot and time series analysis the plottingg of data is done by ggplot and plotly.